alleviating pathological sharpness
The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks
Normalization methods play an important role in enhancing the performance of deep learning while their theoretical understandings have been limited. To theoretically elucidate the effectiveness of normalization, we quantify the geometry of the parameter space determined by the Fisher information matrix (FIM), which also corresponds to the local shape of the loss landscape under certain conditions. We analyze deep neural networks with random initialization, which is known to suffer from a pathologically sharp shape of the landscape when the network becomes sufficiently wide. We reveal that batch normalization in the last layer contributes to drastically decreasing such pathological sharpness if the width and sample number satisfy a specific condition. In contrast, it is hard for batch normalization in the middle hidden layers to alleviate pathological sharpness in many settings. We also found that layer normalization cannot alleviate pathological sharpness either. Thus, we can conclude that batch normalization in the last layer significantly contributes to decreasing the sharpness induced by the FIM.
Reviews: The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks
This well-written paper is the latest in a series of works which analyze how signals propagate in random neural networks, by analyzing mean and variance of activations and gradients given random inputs and weights. The technical accomplishment can be considered incremental with respect to this series of works. However, while the techniques used are not new, the performed analysis leads to new insights on the use of batch/layer normalization. In particular, the analysis provides a close look on mechanisms that lead to pathological sharpness on DNNs, showing that the mean subtraction is the main ingredient to counter these mechanisms. While these claims would have to be verified in more complicated settings (e.g. with more complicated distributions on inputs and weights), it is an important first step to know that they hold for such simple networks.
Reviews: The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks
The paper is well-written paper and analyzes how signals propagate in random neural networks. It does so by analyzing mean and variance of activations and gradients, given random inputs and weights. The technical contributions are okay, and the analysis leads to new insights on the use of batch/layer normalization.
The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks
Normalization methods play an important role in enhancing the performance of deep learning while their theoretical understandings have been limited. To theoretically elucidate the effectiveness of normalization, we quantify the geometry of the parameter space determined by the Fisher information matrix (FIM), which also corresponds to the local shape of the loss landscape under certain conditions. We analyze deep neural networks with random initialization, which is known to suffer from a pathologically sharp shape of the landscape when the network becomes sufficiently wide. We reveal that batch normalization in the last layer contributes to drastically decreasing such pathological sharpness if the width and sample number satisfy a specific condition. In contrast, it is hard for batch normalization in the middle hidden layers to alleviate pathological sharpness in many settings.
The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks
Karakida, Ryo, Akaho, Shotaro, Amari, Shun-ichi
Normalization methods play an important role in enhancing the performance of deep learning while their theoretical understandings have been limited. To theoretically elucidate the effectiveness of normalization, we quantify the geometry of the parameter space determined by the Fisher information matrix (FIM), which also corresponds to the local shape of the loss landscape under certain conditions. We analyze deep neural networks with random initialization, which is known to suffer from a pathologically sharp shape of the landscape when the network becomes sufficiently wide. We reveal that batch normalization in the last layer contributes to drastically decreasing such pathological sharpness if the width and sample number satisfy a specific condition. In contrast, it is hard for batch normalization in the middle hidden layers to alleviate pathological sharpness in many settings.